Skip to content

Defend against ACP server reporting used > size#364

Closed
timvisher-dd wants to merge 10 commits intoxenodium:mainfrom
timvisher-dd:timvisher-dd/tests/add-shell-usage-regression-tests
Closed

Defend against ACP server reporting used > size#364
timvisher-dd wants to merge 10 commits intoxenodium:mainfrom
timvisher-dd:timvisher-dd/tests/add-shell-usage-regression-tests

Conversation

@timvisher-dd
Copy link
Copy Markdown
Contributor

@timvisher-dd timvisher-dd commented Mar 3, 2026

The ACP server (claude-agent-acp) has a bug where model switches cause used to exceed size in session/update notifications. For example, switching from Opus 1M to Sonnet 200k drops size to 200000 while used keeps growing past it (observed: 419574/200000 = 209.8%). This results in nonsensical context indicators and percentages e.g. (from a real session)

 Context: 420k/200k (209.8%)
  Tokens: 32 in · 11k out · 11m cached (11m total)
    Cost: USD75.96

While I intend to get a fix for this improper reporting into claude-agent-acp, agent-shell should be robust against it. To that end, when the garbage usage data is observed, the UI now signals unreliable data instead of showing nonsense:

  • Context indicator shows ? with warning face when used > size
  • Formatted usage shows raw numbers with (?) instead of a bogus percentage
  • A regression test replays the real observed ACP traffic from the model-switch scenario so this class of bug is caught going forward

This also adds comprehensive ERT test coverage for agent-shell-usage.el: notification updates, indicator scaling/colors, compaction replay, token saving, and number formatting.

For the claude-agent-acp side of this see agentclientprotocol/claude-agent-acp#412

Test plan

  • All 21 ERT tests pass
  • checkdoc clean on both files
  • byte-compile clean on both files
  • Manual baking verification

@timvisher-dd timvisher-dd force-pushed the timvisher-dd/tests/add-shell-usage-regression-tests branch from e528a97 to 3e99679 Compare March 3, 2026 19:49
@timvisher-dd timvisher-dd changed the title Add usage tracking tests and fix existing test failures Add usage tracking and context indicator regression tests Mar 3, 2026
@timvisher-dd timvisher-dd marked this pull request as ready for review March 3, 2026 20:08
@timvisher-dd timvisher-dd force-pushed the timvisher-dd/tests/add-shell-usage-regression-tests branch from 3e99679 to 6d914d4 Compare March 11, 2026 17:31
@timvisher-dd timvisher-dd changed the title Add usage tracking and context indicator regression tests Defend against ACP server reporting used > size Mar 11, 2026
@timvisher-dd timvisher-dd marked this pull request as draft March 12, 2026 15:54
@timvisher-dd timvisher-dd force-pushed the timvisher-dd/tests/add-shell-usage-regression-tests branch from 6d914d4 to c98e427 Compare March 13, 2026 01:20
@timvisher-dd timvisher-dd marked this pull request as ready for review March 14, 2026 22:00
@timvisher-dd timvisher-dd force-pushed the timvisher-dd/tests/add-shell-usage-regression-tests branch 2 times, most recently from e2ce5c0 to f64cdbd Compare March 14, 2026 22:02
@timvisher-dd timvisher-dd force-pushed the timvisher-dd/tests/add-shell-usage-regression-tests branch from f64cdbd to 2f153b2 Compare March 14, 2026 23:53
@timvisher-dd timvisher-dd reopened this Mar 15, 2026
@timvisher-dd timvisher-dd force-pushed the timvisher-dd/tests/add-shell-usage-regression-tests branch from 2f153b2 to 8b63348 Compare March 15, 2026 16:05
timvisher-dd and others added 10 commits March 15, 2026 14:49
CI workflow runs byte-compilation and ERT tests on push/PR using
GitHub Actions with deps checked out from timvisher-dd/acp.el-plus
and xenodium/shell-maker.

bin/test parses ci.yml with yq so local runs stay in sync with CI
automatically. It symlinks local dependency checkouts into deps/ to
match the CI layout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New library for sending desktop notifications from Emacs.

In GUI mode on macOS, uses native UNUserNotificationCenter via a
dynamic module (agent-shell-alert-mac.dylib) compiled JIT on first
use (inspired by vterm). When compilation fails (e.g. missing Xcode
CLI tools), a message recommends `xcode-select --install`.

In terminal mode, auto-detects the host terminal emulator and sends
the appropriate OSC escape sequence:
- OSC 9: iTerm2, Ghostty, WezTerm, foot, mintty, ConEmu
- OSC 99: kitty
- OSC 777: urxvt, VTE-based terminals

Inside tmux, wraps in DCS passthrough (checking allow-passthrough
first). Falls back to osascript on macOS when the terminal is
unknown or tmux passthrough is not enabled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follows the same pattern as acp.el: a boolean toggle
(agent-shell-logging-enabled, off by default), a per-shell log
buffer stored in state, and label+format-string logging. Adds log
calls to idle notification start/cancel/fire for observability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After each agent turn completes, a 30s timer starts. Any user input
in the buffer cancels it; otherwise it fires a desktop notification
via agent-shell-alert. The echo area message is only shown when the
shell buffer is not the active buffer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validate that agent-shell-buffers and agent-shell-project-buffers
reflect (buffer-list) ordering correctly: switch-to-buffer and
select-window promote, with-current-buffer does not, bury-buffer
demotes, and project filtering preserves order.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fails PRs that modify .el files or tests/ without also updating
README.org, ensuring the soft-fork features list stays current.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive ERT tests for agent-shell-usage.el covering
notification updates, context indicator scaling/colors, compaction
replay, token saving, and number formatting.

The ACP server has a bug where model switches cause used to exceed
size in session/update notifications. Rather than clamping, signal
unreliable data: indicator shows ? with warning face, format shows
(?) instead of a bogus percentage. A regression test replays real
observed traffic from the Opus 1M -> Sonnet 200k switch scenario.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@timvisher-dd timvisher-dd force-pushed the timvisher-dd/tests/add-shell-usage-regression-tests branch from 8b63348 to b9ca659 Compare March 15, 2026 20:03
@timvisher-dd timvisher-dd deleted the timvisher-dd/tests/add-shell-usage-regression-tests branch March 15, 2026 20:04
SteffenDE added a commit to agentclientprotocol/claude-agent-acp that referenced this pull request Mar 30, 2026
The `usage_update` notification reports `size: 200000` even when the
active model has a 1M context window (e.g. `opus[1m]`), causing clients
to display incorrect context utilization (e.g. `689k/200k (344.3%)`
instead of `689k/1000k (68.9%)`).

Four bugs fixed:

- **Min across all models**: The original code used `Math.min` across
all `modelUsage` entries, so subagent models (Sonnet/Haiku with 200k
windows) dragged down the reported size for the main Opus 1M model. Now
tracks the top-level assistant model and looks up its context window
specifically.
- **Model name mismatch**: The SDK's streaming path keys `modelUsage` by
the requested model alias (e.g. `claude-opus-4-6`) while
`BetaMessage.model` on assistant messages has the resolved API response
model (e.g. `claude-opus-4-6-20250514`). The exact-match lookup always
missed, falling back to the hardcoded 200k default. Now falls back to
prefix matching, preferring the longest/most-specific match.
- **Synthetic messages corrupt model tracking**: `/compact` and similar
commands emit assistant messages with `model: "<synthetic>"`. These were
updating `lastAssistantModel`, causing the next `usage_update` to miss
the `modelUsage` lookup and fall back to the 200k default. Now filters
out `<synthetic>` models.
- **Stale usage after compaction**: No `usage_update` was sent on
`compact_boundary`, so clients kept showing the pre-compaction context
size (e.g. `944k/1m`) right after "Compacting completed" until the next
full turn. Now sends `used: 0` immediately on compaction. This is a
deliberate approximation — the exact post-compaction size isn't known
until the SDK's next API call, which replaces it within seconds. The
alternative (no update) is worse UX: showing a full context bar right
after compaction.

Eight new tests cover: token sum correctness, current-model context
window lookup, model switching, subagent isolation, prefix matching in
both directions, and synthetic message filtering.

Note: `bin/test` (local CI validation script) is cherry-picked from
#353.

Would fix `agent-shell`'s usage indicator which currently has to defend
against this broken math:
xenodium/agent-shell#364

## Testing

- [x] Manual riding with it for a bit.

---------

Co-authored-by: Tim Visher <194828183+timvisher-dd@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Steffen Deusch <steffen@deusch.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant